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Abstract 

Statistical relational learning techniques have been successfully applied in a wide range of relational 
domains. In most of these applications, the human designers capitalized on their background knowledge 
by following a trial-and-error trajectory, where relational features are manually defined by a human 
engineer, parameters are learned for those features on the training data, the resulting model is validated, 
and the cycle repeats as the engineer adjusts the set of features. This paper seeks to streamline application 
development in large relational domains by introducing a light-weight approach that efficiently evaluates 
relational features on pieces of the relational graph that are streamed to it one at a time. We evaluate 
our approach on two social media tasks and demonstrate that it leads to more accurate models that are 
learned faster. 

1 Introduction 

Many machine learning applications involve inherently multi-relational domains in which entities of het- 
erogeneous types engage in a variety of relations. The statistical relational learning (SRL) (8l community 
has introduced representations that provide principled support for learning and reasoning in such multi- 
relational data. In a nutshell, SRL models use an expressive relational language, such as first-order logic or 
SQL, to define relational features capable of capturing salient aspects of the structure of the domain. These 
relational features come with a parameterization, such that, once instantiated, they define a graphical model 
over which probabilistic inference can be performed. SRL techniques have been successfully applied in 
domains as diverse as biology, natural language processing, ontology alignment, social networks, and the 
web. 

The basic design cycle of many such applications follows a trial-and-error trajectory, where relational 
features are manually defined by a human engineer, parameters are learned for those features on the training 
data, the resulting model is validated, and the cycle repeats as the human engineer adjusts the set of features. 
For example, this is the strategy recommended in the Markov logic network tutorial ll5lQ slide 108, that 
comes with the widely used Alchemy software |[T7l . Typically, as a result, a relatively small number of 
relational features are identified and used. This approach is appealing because background knowledge about 
the domain can be easily encoded in the intuitive relational language used in the SRL model at hand, and 
the relative strengths of these features can be learned through the parameterization. 

An alternative to this design cycle is to use structure learning, where both the relational features and 
their accompanying parameterization are induced automatically from data. The state-of-the-art in structure 

'Available under "Tutorial" on http : / / alchemy . cs . Washington . edu/ 
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learning has been advanced significantly in recent years ||9l[T4j|2T][Tl[TTl[T5j[l6l[l3l. This work has resulted 
in highly innovative approaches to identifying potentially promising structure candidates and efficiently 
navigating through the large and complex search space for structures. So far, less emphasis has been placed 
on how, once identified, structure candidates are evaluated. Existing techniques employ different flavors of 
a batch evaluation procedure, where candidate structures are scored on the available data and retained in 
the model if they improve the score. Crucially, all existing techniques presume that all of the data, or, at 
least joins of pairs of relational tables lPT3l . can be stored in memory for the purposes of candidate structure 
evaluation. 

While the importance of existing structure learning approaches is undisputed, their focus on identifying 
promising candidate structures, coupled with the assumption of batch scoring from in-memory data, is not 
a good match to all application scenarios. In particular, the designers of many SRL applications skipped 
structure learning entirely and instead preferred the trial-and-error design cycle outlined above. There are 
at least two plausible reasons underlying their chosen approach. First, the applications frequently involved 
data sets that were simply too large to allow for batch scoring of candidate structures. To overcome this, 
parameters over hand-coded relational features have been trained in parallel [25 ], or on a stream of examples 
|[22l . Beyond the problem of storing large amounts of data in memory, in some domains, the data may 
actually arrive as a stream and not be available all at once. 

Second, the designers of many applications often already have intuitions about good relational features, 
either from existing domain knowledge, such as in natural language or social network tasks, or from hav- 
ing worked with the "raw" data and developed a representation for it that is already biased by what they 
intuitively perceive as important aspects to capture. In such cases, where domain knowledge allows one 
to identify what variables are likely to influence each other, the design bottleneck is in pinning down the 
exact formulation of the influences among variables and eliminating features based on spurious intuitions. 
In other words, the problem of evaluating candidate features efficiently is at least as important as suggesting 
them. 

This paper seeks to streamline application development in relational domains by introducing an ap- 
proach that efficiently evaluates relational features on pieces of the relational graph that are streamed to it 
one at a time. We call our approach RESOLWE, for Relational Structure Selection from Online Light-Weight 
Evaluation. RESOLWE is agnostic to where the relational features it evaluates come from. They could be 
discovered locally from each piece of the data that is being streamed using an existing structure discovery 
approach, or provided by an application developer. In this paper we take a semi- automated approach where 
the human designer specifies a declarative bias, e.g., H, by providing a grammar for the relational features, 
from which all possible features are generated. This corresponds to the scenario outlined above where gen- 
eral intuitions about the domain are available, but determining the precise formulation of features requires 
efficiently evaluating different versions of the model on a large relational data set. 

We implement our approach in the framework of Markov logic networks (MLNs) ||26| . This choice is 
motivated by the fact that MLNs have been widely used for relational application development. We start by 
providing necessary background in Section [2] RESOLWE is described in Section [3] In Section [4j we flesh 
out our proposed technique by using it to develop two applications in relational social media domains. Our 
results indicate that RESOLWE leads to significantly faster and more accurate learning. We conclude with a 
discussion of related and future work. 

2 Background, Notation, Assumptions 

First-order logic terminology. In first-order logic, relations are represented as predicates, such as 
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articleEdit (article, editor) , which are Boolean functions with typed arguments. Assuming 
that the domain contains no functions, a term is defined as a variable or a constant. An atom is a predicate ap- 
plied to terms, such as articleEdit (A, E ) . A positive/negative literal is a non-negated/negated atom. 
A literal is grounded if all of its arguments are constants, or actual entities from the domain; conversely, a 
literal is ungrounded if all of its arguments are variables. A formula consists of literals connected by con- 
junction and disjunction. Formulas whose literals contain only constants are grounded, whereas formulas 
whose literals contain only variables are ungrounded. We will refer to grounded formulas as groundings. 

Learning Setting: We assume fully observable training data that consists of a large relational graph Q 
whose hyperedges and nodes can correspond to different relations and entity types respectively. In practice, 
Q is typically too large to fit in memory all at once, and/or parts of it arrive as learning progresses. Thus, 
we address a scenario where subgraphs of Q arrive in a data stream S. Second, we assume a discriminative 
setting where one or more relations in a set Vt are designated as target predicates whose values are to be 
predicted at test time, and the remaining relations Ve are the evidence predicates whose values are observed 
at test time. 

Markov Logic Networks. A Markov logic network (MLN) [26] consists of a set of first-order logic 
formulae F, each of which has an associated weight. MLNs can be viewed as relational analogs to Markov 
networks, in which the potential functions over cliques are defined by the groundings of the formulae in 
F. The role of first-order logic, therefore, is to provide a highly expressive language for specifying general 
relational features. 

Grounding the first-order logic formulae with a given set of entities results in a Markov network. In 
particular, an MLN computes the conditional joint probability of a set of predicate groundings X of the 
target predicates Vt, given truth values for a set of evidence predicate groundings Y as follows: 

expfV f c -p u>j7ij(x, y)) 
P(X = x|Y = y)= \ (1) 

Above, X and Y are the sets of all target and evidence predicate groundings, respectively; x and y are the 
sets of corresponding truth assignments; W{ is the weight associated with formula fi, nj(x, y) is the number 
of true groundings of formula fi on truth assignment x, y; and the denominator computes the normalizing 
partition function Z. 

For ease of exposition, in this work we assume that all of the formulas in the MLN are conjunctions. 
This is not a restrictive assumption for the following reason. The most mature implementation of MLNs, 
Alchemy [ 17 ], handles arbitrary formulas by converting them into conjunctive normal form, as a conjunction 
of disjunctions. Each disjunction produced in this way is then treated as a separate formula in the MLN, 
i.e., by viewing the MLN as a soft conjunction of disjunctions. Each disjunction in the MLN can then be 
converted to a conjunction by negating it, and also negating its weight. 



3 The resolwe Algorithm 

Learning on the stream S proceeds in a three-stage process: 

1 . The first k\ sub-graphs that arrive are used to generate a set of relational features (in our case, first- 
order logic MLN formulae) F . 

2. RESOLWE uses the next ft2 sub-graphs from S to evaluate the formulae in F, outputting a subset 
F* C F consisting of the formulae that, together, give good predictive accuracy for the groundings 
of Vt given as observations the groundings of Ve- 
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3. The remainder of S is used to train parameters on the formulae in F* . 

In this paper, the set of formulas F in step 1 above is generated without requiring any data from the 
stream (i.e., k\ = 0) by using a declarative bias (in the spirit of EJ) to specify templates, from which 
all possible formulas that comply with the bias are formed. Details on the templates used in each of our 
domains are provided in Section [4j Our goal in using this semi-automatic procedure to generate F was to 
take advantage of available background knowledge, while using systematic rule generation to make sure we 
do not inadvertently place the baseline to which we compare at a disadvantage 

3.1 Criteria for Effective Formulae 

The goal of RESOLWE is to provide a light-weight formula evaluation strategy that can be carried out by 
considering sub-graphs of the training data graph Q that arrive in a stream one at a time. The key is to develop 
criteria for what makes an effective formula and ensure that these criteria can be evaluated efficiently on S. 
For this purpose, it is useful to rewrite each formula F as E A Q, where E is the evidence sub-formula 
and consists of all literals of F with predicates in Ve, Q is the query sub-formula and consists of all 
literals of F with predicates in Vt^\ We can view the roles of E and Q in F as selector and enforcer 
respectively: groundings of F in which the corresponding grounding of E is satisfied are "selected" and 
a particular pattern, or configuration of truth values, specified by Q is "enforced" over the truth values 
of the corresponding grounding of Q. This is because the truth value of groundings for which the part 
corresponding to E is false can never change to true, regardless of the assignments made to the part 
corresponding to Q, and thus do not affect the values assigned to the ground literals corresponding to Q and 
are safely ignored by the inference. In other words, because groundings only affect inference if their E part 
is satisfied, E can be viewed as "selecting" those groundings. 

This view of F allows us to specify two criteria for its effectiveness: (1) among the groundings of Q 
selected by E, how uniformly do we observe the pattern that Q enforces in the data; and (2) how surprising 
is that pattern, i.e., how likely is one to observe the pattern in a randomly selected set of groundings of Q. 
Intuitively, the uniformity criterion measures the correctness of F. However, in the case of SRL models, we 
are equally interested in very correct formulas and very incorrect ones, the latter getting negative weights 
during parameter training. The motivation for the second criterion is that we would like to find relational 
features that capture aspects beyond those that can be captured by simply using a prior over truth values. 

Next, we make these criteria precise. Let £ be a set of ungrounded literals and let Gc be a randomly 
chosen grounding of C. The joint assignment of truth values to the grounded literals in Gc is a random 
variable Xc with 2^ possible outcomes, i.e. if C contains a single literal, the possible outcomes are 
{T, F}; if it contains two literals, the possible outcomes are {TT, TF, FT, FF}, etcj^We are interested in 
the probability distribution that governs Xc- This distribution can be estimated empirically given observed 
truth assignments for a set of groundings Xc of C by simple counting as the proportion of time a particular 
configuration of values is observed. 

Definition 1. Let represent the empirical distribution over joint truth assignments to a randomly chosen 
grounding of £, as estimated on a set of groundings X. 

Armed with this notation, we now go back to the view of a formula F as consisting of a selector E and 
an enforcer Q and consider the empirical distribution Pg E , where Q is the set of literals in Q, and Xe is 
the set of groundings of Q selected by E, which is, in general, a subset of all possible groundings of Q. 
The first, uniformity, criterion identified above states that in an effective formula, ¥^{Q), the probability, 

2 As described in Section[2] F is assumed to be a conjunction. 

3 Here, for simplicity we are ignoring the case where, after grounding C, some of the literals become identical. 
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according to Pgf, of observing the configuration of truth values enforced by Q should be as extreme as 
possible. 

Criterion 1. Effective formulas maximize max(Pgf (Q), 1 — P^f((3)). 

The second criterion states that Pg E should be significantly different from the "default" Pg 4 ", where 
Iau is the set of all possible groundings of Q. In other words, we are interested in formulas whose selector 
E homes in on groundings of Q for which the distribution over observed truth values deviates significantly 
from the default distribution over all possible groundings of Q. 
Criterion 2. Effective formulas maximize distiF 1 ^ , Pq 4 "). 

Criterion 2 can be evaluated using standard measures of the distance between two distributions, such as 
KL divergence [2], or by methods that are specifically designed to detect significant deviations on a large 
scale, e.g., Q. 

3.2 Simplifying Assumption 

However, evaluating such measures may be expensive. In particular, to estimate Fq" , we need to enumerate 
the observed joint truth values for all possible n k groundings of Q, where n is the number of entities in the 
domain and k is the number of distinct variables appearing in Q. In general, this number is much higher 
than the number of groundings selected by E. Instead, we note that relational domains are typically very 
sparse, i.e. the number of relations actually observed to be true is typically much smaller than the total 
number of possible relations that can form. Thus, rather than estimating F X q U for different sets Q from the 
data, we can make the approximately correct assumption that P^ 4 " will be skewed towards configurations 
that involve false assignments to the literals. In essence, this means that we can assume the same skewed 
default distribution P^ A " for all sets Q that contain i literals. This assumption allows us to significantly 
simply the evaluation of rules according to the two criteria. Next, we describe how this is done for sets Q of 
different sizes. 

The simplest case is when \ Q\ = i = 1, i.e., where the formula F contains a single literal Q\ of a 
target predicate. Supposing that Q\ is non-negated, F^ AU is a Bernoulli distribution, which, because of the 
skew towards false assignments, has a very small probability of success. Thus, to satisfy criterion 2 and 
maximize dist(Fg S , Pg 1 "), Pg E needs to have a large probability of success. Combined with criterion 1, 
we note that the only way both criteria may be satisfied is if P^f (Qi) is maximized. Thus, when | Q\ = 1, 
maximizing both criteria is as simple as requiring that the rule correctly identifies regions that contain high 
proportions of true positives. The case when Q\ is negated is symmetric. 

The situation is a bit less straightforward when \ Q\ = i = 2. For ease of exposition, let us assume that 
Q contains two non-negated literals Q\ and Q2. Now, according to the sparsity assumption, F^ Aa is skewed 
towards truth assignments in which at least one of Qi or Q2 is false. Thus, in order to satisfy criterion 2, we 
need formulas for which Pg B places significant mass on the case where Q\ and Q2 are both true. Combined 
with criterion 1, this means that effective formulas are ones for which P^ B (Qi A Q2) is large. Due to the 
sparsity in relational domains, in practice formulas with such selectors E are rare. A second way to address 
criterion 2 is to look for formulas in which the conditional probability of one of Q\ or Q2 being true, given 
that the other one is true, is surprisingly high. In terms of criterion 1, this translates into formulas for which 
Pgf(Qi =^ Q2) is high. Thus, when \ Q\ =2, RESOLWE autonomously determines the precise formulation 
of the enforcer Q from the non-neg ated literals in Q by evaluating Pg E (Qi A Q 2 ), Pq b (Qi Q2), and 
pg5(Q 2 =>. Q x ) and choosing ones that result in high values. We note that the formulation Qi Q2 does 
not contradict our assumption that F is a conjunction. Q\ =^ Q2 can be expressed in conjunctive form as 
~<(,Qi A ->Q2), thus F = £ , A-i(Qi A-iQ2)- When all literals in E are observed, the effect of Fis equivalent 
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Algorithm 1 RESOLWE Algorithm 
Input: F: set of formulas 

Vt'- set of target predicates 

Ve' set of evidence predicates 

S: stream of training subgraphs 

tt2'- number of streamed training subgraphs to use 

9: threshold 
Output: F* : selected formulas 
1: for each of the next fc 2 subgraphs s G S: do 
2: for each F G F do 

3: E" = sub-formula of F consisting of literals with predicates in Ve 
4: Q = {Qi, ■ ■ ■ , Qi} set of literals of F with predicates in Vt 
5: Compute Pjf (Qi A Q 2 • • • A . . . Qi) 
6: for each k G [1, /] do 
7: Compute P2 E (Q fc | A; e[Mi ^ fe Qi) 

8: end for 
9: end for 
10: end for 

ll: for each F G J" do 

12: if Average^ (Qi A Q 2 ■ ■ ■ A . . . Qi)) > 9 then 
13: Add E A Qi A Q 2 • • • A . . . Q, to F* 
14: end if 

15: for each k G [1,/] do 

16: if Average^ {Q k \ A ie[M]; ^ fc Qi)) > 9 then 

17: Add E A (Qi A • • • A Q fc _i A Q k+1 A • • • A Qi Q k ) to J 7 *. 

18: end if 

19: end for 

20: end for 



to that of a formula F' = E A Q\ A -1Q2, for which a negative weight is learned, despite the fact that F and 
F' are not logically equivalent. 

To summarize, in the general case, when Q consists of I literals Qi, Q21 ■ ■ ■ , Qz> RESOLWE evaluates 

P z £(QiAQ 2 ---a...Qi), 

and for each k G [1, Z] 

FQ E (<5fcl \<E[i,i],i^k Qi = true ) 

and selects formulations that have high probabilities. 

This process is summarized in Algorithm[T] Steps 6-8 and 15-19 are only evaluated if the formula has 
more than one literal of a target predicate. The algorithm reduces the evaluation of the candidate formulas 
to the collection of a few statistics for each formula that can be easily computed on a stream of examples. 
Moreover, by taking advantage of the simplifying assumption, the algorithm avoids having to estimate Pg 4 " . 
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4 Experimental Evaluation 



In this section, we demonstrate the methodology proposed in Section [3] by developing applications in two 
social media domains. We compare RESOLWE to a system (called SKIPSELECTION) that skips the second 
step outlined at the beginning of Section [3] and directly trains weights on the formulas in the original set T . 
Because for formulas with | Q\ > 1 RESOLWE automatically determines the correct logical connectives and 
negations in the part of the formula that involves literals of the target predicate (i.e., lines 6-8 and 15-19 in 
Algorithm[T]), the set T over which weights were trained by SKIPSELECTION included all possible formulas 
considered by RESOLWE. The goal of our experiments was 1) to determine whether more accurate models 
are obtained with RESOLWE; and 2) to evaluate the relative efficiency of RESOLWE and SKIPSELECTION. 

We implemented RESOLWE as part of the Alchemy system fI71 . For weight- training, we adapted Con- 
trastive Divergence [ 19] to learn from relational instances that arrive in a stream and used a Gaussian penalty 
on the weights. We preferred this algorithm over other available methods because we are interested in an ef- 
ficient, light-weight approach. For inference, we used the MC-SAT algorithm ll24l . Generation of formulas 
from provided templates was implemented as a separate module in python. 

4.1 Data Sets 

The experiments were conducted in two social media domains - WikiCollabs and Delicious. 
4.1.1 WikiCollabs 

The task in this domain is to predict project-specific collaborations in Wikipedia]^] The data consists of 
all 3,538 Wikipedia articles that appeared in the featured^ and controversial] lists in the period Oct. 7-21, 
2009. These articles are interesting because they are richly connected, both by their hyperlinks and by their 
human network of editors |3]. For each article, we collected the editors who contributed to it, either by 
directly editing the article, or by editing its "Talk," i.e., discussion, page. Only edits that were not marked 
as "minor" by the editor were considered. In this way, we obtained a set of 280,068 editors. In addition, 
we collected the hyperlinks among the articles. These articles are densely inter-linked, as indicated by the 
large number of hyperlinks (45,006) among them. Wikipedia articles often refer to external resources on the 
Web. Thus, for each article, we looked up the categorizations of each of its external references in the DMOZ 
open director)Q Because this information is not available for all URLs, we considered both exact matches 
of URLs, for which there were about 0.9 per article, as well as exact matches for just the domain name 
part of the URL, for which there were about 77 per article. An editor E\ on Wikipedia can communicate 
with another editor Ei by editing E^^ "Talk" page. There were a total of 7,874,985 instances of communi- 
cation between pairs of editors. The set of evidence predicates Ve included articleEdit (article, 
user) and articleTalk (article, user) for the two ways in which a user may contribute to 
an article; userTalk (user, user) ; hyperLink (article, article) ; similar (article, 
article) , indicating that the cosine similarity between the tf/idf- weighted bag-of- words representation 
of the stemmed text in the two articles is between 0.1 and 0.5; very Similar (article, article), 
indicating the similarity is greater than 0.5; category (article, category) , providing the cate- 
gory under which the article appeared on the featured or controversial list, levelNExact (article, 

' http : / / en . wikipedia ■ org/wiki/Main_Page 

' http : / / en . wikipedia ■ org/wiki /Wikipedia : Featured_lists 

' http : / / en . wikipedia . o r g/wiki/ Wikipedia : Li st_of_cont rover sial_is sues 
http : / /www . dmoz . org/ 



7 



Table 1: Templates used to generate formulas in WikiCollabs. 



EDIT(tl, u) A SIMPLE JlEL(tl, t2) =>■ modif ies(*2, u) 
EDIT(tl, u) A LONG.REL(tl, t2) =>■ modif ies(*2, u) 



(2) 
(3) 
(4) 
(5) 



USER_REL(ul, u2) A modif ies(t, £71) ^> modif ies(<, u2) 
USERJiEL{ul, u2) A modif ies(f, Ul) A modif ies(t, u2) 



externalCategory ) and levelNInexact (article, externalCategory ) for the different 
levels 2V = 1, 2, 3 in the DMOZ hierarchy in which an external URL from an article is filed, either for exact 
URL matches or for matches of just the domain name of the URL. 

The data from the WikiCollabs network was streamed in subgraphs Gc that were centered around one of 
the editors C. Gc contains all articles Aq to which C is related via the articleEdit or articleTalk 
predicates and all editors Ec, in addition to C that contributed to any of the articles in Ac, as well as 
the other articles to which the editors in Ec contributed. Also included were any hyperlinks among the 
included articles, any instances of user Talk relationships among the editors in Ec, any available category 
information on the included articles. The task was to learn to predict which editor in Ec contributes to 
the articles in Ac, given all other information. For convenience, we represented the relationship between 
articles in Ac and the users in Ec by the modifies (article, user) predicate, which was the only 
target predicate in Vt- Subgraphs were formed for users C who made edits to the encyclopedia on at least 
30 distinct days, had at least 30 collaborators, and edited at most 15 different articles. These restrictions 
are motivated by the observation that collaborator suggestion is most needed by editors who are strongly 
engaged with the encyclopedia, and so contribute to it over extended periods, but at the same time are 
focused in their interests. In this way, we excluded users, such as the "60% of registered users [who] never 
make another edit after their first 24 hours" (23), as well as users who help oversee the editing process and 
are therefore somewhat superficially involved in large numbers of edits, from having subgraphs Gc formed 
around them. However, such users can still appear in the subgraph of another user. We obtained a total of 
1785 subgraphs. 

Formulas for the WikiCollabs task were generated from the templates shown in Table [T] Predicates 
in all-caps are templates that get expanded in designer-specified ways, as shown in Table [2] As can be 
seen from these expansions, the EDIT template captures the different ways in which an editor may be 
related to an article, the SIMPLE _REL and LONG.REL expand to the different ways in which two 
articles may be related, and U SER.REL expands to the different ways in which two users may be related. 
The EDIT and SIM PLE_REL predicate templates were declared to be compounders, which means that 
when they are expanded in a rule, they can be replaced by a conjunction of more than one of their possi- 
ble expansions. For example, in rule 1 above, EDIT(tl, u) may be expanded to articleEdit(il, u), or 
articleTalk(il, u), or articleEdit(il, u) A articleTalk(il, u). We limited the length of compound- 
ings to at most 2. RESOLWE received only one version of rules generated from templates [4] and [5] as it 
determines the correct configuration of literals of target predicates automatically. 
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Table 2: Expansions for the predicate templates used in WikiCollabs. 



EDIT(tl, u) 
SIMPLE MEL(tl,t2) 



{articleEdit(il, u)|articleTalk(il, u)} 
{similar(tl,t2)|verySimilar(tl, t2)\ 
hyperlink(il, t2) |hyperlink(i2, tl) | 
category(il, c) A category(i2, c)} 
{level[l|2|3]Exact(il, c) A level[l|2|3]Exact(i2, c)\ 
level[l|2|3]lnexact(tl, c) A level[l|2|3]lnexact(i2, c)} 
{userTalk(ul, u2)\ 

articleEdit(t, ul) A articleEdit(i, u2)\ 
articleTalk(t, ul) A articleTalk(i,u2)| 
articleEdit(t, ul) A articleTalk(t, u2)| 
articleTalk(t, ul) A articleEdit(i, u2)} 



LONG_REL(tl, t2) 



USERMEL(ul,u2) 



4.1.2 Delicious 

The task in this data set is to predict user friendships on the Delicious social bookmarking site[^][29]. We used 
the data collected by the authors of [29], which includes 425,486 instances of the "fan" relationship, which 
indicates that one user is a fan of another one, 446,879 instances of the "network" relationship, which is the 
inverse of "fan" (i.e., if A is a fan of B, then B is in A's network), and 48,809,570 instances of the ternary 
"tagging" relationship between a user, a tag, and a URL. Although the "fan" and "network" relationships 
are inverses of one another, the observations in the data were not complete. We completed them by treating 
them as a single "friendship" relationship. 

To stream this data, we formed subgraphs Gc, each of which was centered at one of the users C. The task 
was to predict all friendships of C. Each Gc included C's actual friends Frc as true positives, and, as true 
negatives, a sampling of users who are friends with users from Frc- We did not form subgraphs for users C 
for which the number of true negative friends was not at least as large as the number of true positive friends. 
The friendships between C and the other users were hidden at test time, and the goal was to predict their ex- 
istence. However, friendships among users other than C were observed. For convenience in distinguishing 
between these two cases, we included an observed and an unobserved version of the friendship rela- 
tionship: an unobserved, i.e., target, cFriends (user) predicate indicating that the user is friends with 
the implicit C, and an observed, i.e., evidence, friends (user, user) predicate indicating that the two 
users are friends. For all users in Gc, we included observations about all URLs they bookmarked, along 
with the tags used. Those were captured via the following predicates: bkMark(page, user, tag); 
bkMarkAf ter (page, user) , which indicates that the user bookmarked the page at least one day af- 
ter it was bookmarked by C; bkMarkBef ore (page, user) , bkMarkSameDay (page, user) , 
which provide analogous information for pages bookmarked before or on the same day as bookmarked by 
C; usedTag(tag, user); sameTag (user , user) and sameUrl (user, user) to indicate, 
respectively, that two users (different from C) used the same tags and bookmarked the same URLs. We used 
656 subgraphs constructed in this way. 

i http : / /www . delicious . com/ 
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Table 3: Templates used to generate formulas in Delicious. 



REL{ul) =► cFriends(ul) (6) 

LONG.REL{ul) => cFriends(ul) (7) 

UREL(ul, u2) A REL(ul) cFriends(u2) (8) 

UREL(ul, u2) A LONG-REL(ul) cFriends(w2) (9) 

UREL(ul, u2) A cFriends(ul) =*> cFriends(u2) (10) 

UREL{ul, u2) A cFriends(ul) A cFriends(u2) (11) 



Table 4: Expansions for the predicate templates used in Delicious 



REL(ul) ={bkMarkAfter(p, ul)|bkMarkBef ore(p, ul)|bkMarkSameDay(p. ul)} 
LONGJiEL(ul) ={usedTag(t, ul) A usedTag(i, C)|bkMark(p, ul, t) A bkMark(p, C, t)} 
UREL(ul, u2) ={f riends(ul, it2)|sameTag(ul, u2)|sameUrl(ul, u2)} 



Formulas for the Delicious task were generated using the templates shown in Table [3] We used the ex- 
pansions shown in Table|4] The REL and LONG-REL templates expand to predicates that relate users to 
the user C around whom the graph Gc is centered via various bookmarking and tagging activities, whereas 
U REL expands to different ways of relating two users. REL was declared a compounder, and U REL was 
declared an extender, which meant that one or more possible expansions could be chained together. For 
example, UREL(ul,u2) could be expanded in ways such as f riends(itl, zl) A sameUrl(2fl, u2). We 
allowed extensions and compoundings of length at most 2. As before, RESOLWE only needs the expansion 
from only one of the templates in lines [10] and [TT] as it determines the specific formulation autonomously. 



4.2 Methodology 

We performed four-fold cross-validation by splitting the subgraphs in the data randomly into 4 sets and 
performing 4 train/test runs, in each run withholding one of the folds for testing and training on the remaining 
three. We used A>2 = 30 and 9 = 0.4 in Algorithm [T] Before training weights, both for RESOLWE and 
SKIPSELECTION, we included a clause consisting of a single literal of the target predicate. This is standard 
practice in MLN applications that enables the model to capture the bias towards false assignments by 
learning a negative weight on this single-literal clause. 

The results are summarized using two standard metrics from the information retrieval literature ll20l : 

• (MAP) Mean average precision, which is identical to the area under the precision-recall curve. The 
MAP score is computed over a set of test subgraphs S as follows: 

MAP(5) = ^^-^^P@r. 
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Figure 1: Comparison between RESOLWE and SKIPSELECTION in terms of Mean Average Precision and 
Area under the ROC curve in WikiCollabs (left) and Delicious (right). Observed differences are significant 
at the 0.001 level. 

Above, R s is the set of all possible (p, c) pairs, and the precision at r is defined as 

Num of true positive pairs among the top r 

P@r = 



(AUC-ROC) Area under the ROC Curve, which is identical to the mean average true negative rate. 
This score is computed as follows: 



AUC-ROC(S) = ±- r^T E TN@r, 



\S\ ^ — ' -Rs, 

1 1 s<=S 1 S| reRs 

where the true negative rate at r is defined as 

Number of true negatives below position r 



TN@r 



Total num true negatives 



4.3 Results 

The results of our experiments are shown in Figure [T] All differences in this figure are significant at the 
0.001 level according to a paired t-test. As can be seen, selecting formulas with RESOLWE before training 
weights leads to significant improvements in both domains according to both metrics. Because the AUC- 
ROC performance of a random predictor would be 0.5, we can see that, in fact, by using RESOLWE, we 
can go from near-random performance, to significantly higher accuracy levels. Moreover, RESOLWE learns 
significantly faster than SKIPSELECTION. Table [5] presents results for the number of minutes taken by 
RESOLWE and by weight learning on dedicated Xeon 2.67GHz CPUs, averaged over the 4 folds in each 
domain. In both cases, using RESOLWE leads to dramatic decrease in training time. 

We note that our results in the Delicious domain are not comparable to those of Zhou et al. ll29l because 
their system uses global computations over all available data to arrive at predictions, whereas here we focus 
on making predictions using information local to subgraphs of the original relational graph. 



5 Related Work 



Structure learning and feature selection are important problems that have been widely studied in both re- 
lational and i.i.d. settings. Most feature selection approaches, e.g., ifTOl . have been developed for non- 
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Table 5: Training time in minutes for completing steps 2 (formula selection) and 3 (parameter learning), as 
outlined at the beginning of Section [3] averaged over 4 folds. 





WikiCollabs 


Delicious 




Step 2 


Step 3 


Total 


Step 2 


Step 3 


Total 


RESOLWE 

skipSelection 


94.01 


91.15 
3236.40 


185.16 
3236.40 


30.66 


62.76 
602.08 


93.42 
602.08 



streaming classification settings. One recent exception is the work of Wu et al. [28 ], who study a classifica- 
tion task where the features arrive in a stream, while the data set is fixed. In contrast, here we explore the 
setting where the pool of features is fixed, but the data arrives as a stream. 

Closely related to this paper is work on structure learning of statistical relational models (9l [331 ETJ 
[TJ[TTJ[T5l[T6l[T3l. This literature has made important advances on focusing the search through the super- 
exponential space of candidate models, thus discovering more accurate candidates faster. Less emphasis 
has been placed on how to evaluate candidate structures and, in most existing work, evaluation has been 
carried out by computing a probabilistic score over candidate structures that, crucially, assumes that the 
training data is available in memory. In contrast, this paper addresses the complementary setting common 
in many relational applications where sufficient background knowledge is available to generate candidate 
structures, and the challenge is in how to efficiently evaluate them on data that is presented to the learner in 
a stream. The set-up explored here is probably most similar to that assumed by Huynh and Mooney ifTTl . 
where formula selection and parameter training are carried out in two separate stages. However, while that 
previous work also employs an accuracy-based measure (that of Aleph (271) to evaluate rule candidates, it 
does not address the task of evaluating candidates that have more than a single literal of the target predicate 
and does not consider streaming the relational instances. 

A few authors have addressed learning of structure from data streams. Dries and De Raedt [ 6] introduced 
an inductive logic programming technique that uses candidate elimination to learn theories from a stream 
of examples. Their work applies to noise-free data. Recently, Kummerfeld and Danks [18] introduced a 
"Temporal Difference Structure Learning" Algorithm that learns causal structure from a data stream. This 
algorithm targets causal discovery in graphical models and is not applicable to the relational setting assumed 
here. 

Learning from data streams in relational settings has so far focused on training the parameters of a model 
for which structure is provided (as done in SkipSelection, described in Section |4]). This approach was 
adopted by Mihalkova and Mooney [22] and in upcoming work by Huynh and Mooney ||T2| . 

6 Conclusion 

We proposed an approach to streamlining application development in relational domains by efficiently eval- 
uating a set of candidate formulas on relational instances that are streamed one at a time. The evaluation 
algorithm is derived from two natural criteria and efficiency is achieved by exploiting the fact that typical 
relational domains are sparse. We fleshed out our approach to develop two applications in large and noisy 
social media tasks, demonstrating significant gains in the speed and accuracy of learning. 

Avenues for future work include adapting this approach to tackle domains that experience gradual con- 
cept drift. One way to do this is to interleave steps 2 and 3 outlined at the beginning of Section [3] and use a 
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decaying average of the statistics calculated by Algorithm [T] As soon as step 2 determines a change in the 
structure over which weights are learned in step 3, the change is implemented, keeping the weights of the 
remaining rules at their currently learned values, and the process continues. A second potential direction for 
future work is exploiting other ways in which relational data may be streamed to the learner. For example, 
one interesting setting arises when the learner is allowed to actively decide in what ways and how much to 
grow the subgraphs Gc around entities of interest C. 
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